Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relationship betw cray.nnf.node.drain taint and Storage resource. #188

Merged
merged 4 commits into from
Aug 1, 2024

Conversation

roehrich-hpe
Copy link
Contributor

No description provided.

@roehrich-hpe
Copy link
Contributor Author

See NearNodeFlash/nnf-sos#341

Copy link
Collaborator

@jameshcorbett jameshcorbett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

The Storage status will be "Drained".

Document how to use the Storage's .spec.state to manually disable a node.

Signed-off-by: Dean Roehrich <[email protected]>
@jameshcorbett
Copy link
Collaborator

After reading this, my understanding is that disabling and draining a rabbit node is the same from Flux's perspective, because applying the taint causes the storage to go to .status.state == Disabled, which is the only thing Flux checks. Can you provide some guidance on when you would want to drain vs disable?

@roehrich-hpe
Copy link
Contributor Author

After reading this, my understanding is that disabling and draining a rabbit node is the same from Flux's perspective, because applying the taint causes the storage to go to .status.state == Disabled, which is the only thing Flux checks. Can you provide some guidance on when you would want to drain vs disable?

Today I changed the PR so that status.state==Drained when the drain taint is applied.

The update to the doc today begins by describing how to disable a node, by setting spec.state=Disabled. It explains that any jobs currently active will be unaffected, but that the WLM won't schedule more jobs on it. This "Disabled" mechanism is likely to be the most-often-used action.

Then it goes into the drain part, describing this drain taint. This would be used only after you have removed the jobs from that rabbit (preferably) and have some reason to also remove the NNF software from it. I'd like to have this happen before a rabbit is powered off and pulled out of the cabinet, but we haven't been doing this and the cluster seems to be doing fine anyway, other than that it leaves the "Terminating" pods behind (harmless, annoying noise).

If an admin used this taint before power-off it would mean you won't have "Terminating" pods laying around for that rabbit. And after a new/same rabbit is put back in its place, the NNF software wouldn't jump back on it unless the taint has been removed. The taint can be removed at any time, from immediately after the node is powered off up to some time after the new/same rabbit is powered back on.

@behlendorf, ^

@jameshcorbett
Copy link
Collaborator

This would be used only after you have removed the jobs from that rabbit (preferably) and have some reason to also remove the NNF software from it. I'd like to have this happen before a rabbit is powered off and pulled out of the cabinet, but we haven't been doing this and the cluster seems to be doing fine anyway, other than that it leaves the "Terminating" pods behind (harmless, annoying noise).

If an admin used this taint before power-off it would mean you won't have "Terminating" pods laying around for that rabbit. And after a new/same rabbit is put back in its place, the NNF software wouldn't jump back on it unless the taint has been removed. The taint can be removed at any time, from immediately after the node is powered off up to some time after the new/same rabbit is powered back on.

I wonder if you might add this to the doc? It seems very helpful.

Signed-off-by: Dean Roehrich <[email protected]>
@roehrich-hpe roehrich-hpe merged commit 6fb9eec into main Aug 1, 2024
1 check passed
@roehrich-hpe roehrich-hpe deleted the taint-storage branch August 1, 2024 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Closed
Development

Successfully merging this pull request may close these issues.

4 participants